Anomaly Detection: Moving Z-Score and Bayesian Changepoints Model

Introductory Remarks

Anomalies are data points that are different from other observations in some way, typically measured against a model fit to the data. On the contrary with the ordinary descriptive statistics, we are interested here to found where these anomalous data points exist and not exclude them as outliers.

We assume the anomaly detection task is unsupervised, i.e. we don’t have training data with points labeled as anomalous. Each data point passed to an anomaly detection model is given a score indicating how different the point is relative to the rest of the dataset. The calculation of this score varies between models, but a higher score always indicates a point is more anomalous. Often a threshold is chosen to make a final classification of each point as typical or anomalous; this post-processing step is left to the user.

The GraphLab Create (GLC) Anomaly Detection toolkit currently includes three models for two different data contexts:

  • Local Outlier Factor, for detecting outliers in multivariate data that are assumed to be independently and identically distributed,
  • Moving Z-score, for scoring outliers in a univariate, sequential dataset, typically a time series, and
  • Bayesian Changepoints for identifying changes in the mean or variance of a sequential series.

In this short note, we demonstrate how the Moving Z-Score and Bayesian Changepoints models can be used to reveal anomalies in a time series object. As an example we are going to use the "Crude Oil Prices: Brent - Europe" time series, FRED-DCOILBRENTEU, as it is currently provided by the Quandl database of finance and economic data and the Federal Reserve Bank of St. Luis. This times series covers the daily closing prices of Crude Oil Brent - Europe (Dollars per Barrel, Not Seasonally Adjusted) starting from May 1987 to May 2016. It follows a pretty volatile behavior across the years, and we hope to find out where the most anomalous spot values are. For notes and definitions, please see the corresponding US Energy Information Agency (eia), Explanatory Notes.

The GLC Moving Z-Score Model

In a first step of our analysis, we are going to use the GLC Moving Z-Score implementation. This unsupervised learning model fits a moving average to a univariate time series and identifying that way points that are far from the fitted curve. The MovingZScoreModel works with either TimeSeries or SFrame inputs. A uniform sampling rate is assumed and the data window must be defined in terms of number of observations.

The moving Z-score for a data point $x_{t}$ is simply the value of $x_{t}$ standardized by subtracting the moving mean just prior to time $t$ and dividing by the moving standard deviation which is calculated for the same time interval. In particular, assuming that $w$ stands for the window_size in terms of the number of observations the moving Z-score is defined as:

\begin{equation*} z(x_{t}) = \frac{x_{t}-\bar{x}_{t}}{s_{t}}, \end{equation*}

where the moving average is:

\begin{equation*} \bar{x}_{t} = (1/w)\,\sum_{i=t-w}^{t-1}x_{i}, \end{equation*}

and the standard deviation for the same time interval:

\begin{equation*} s_{t} = \sqrt{(1/w)\,\sum_{i=t-w}^{t-1}(x_{i}-\bar{x}_{t})^{2}}. \end{equation*}

Notes:

  1. The moving Z-score at points within the window_size observations of the beginning of a series are not defined, because there are insufficient points to compute the moving average and moving standard deviation. This is represented by missing values.

  2. Missing values in the input dataset are assigned missing values (‘None’) for their anomaly scores as well.

  3. If there is no variation in the values preceding a given observation, the moving Z-score can be infinite or undefined. If the given observation is equal to the moving average, the anomaly score is coded as 'nan'; if the observation is not equal to the moving average, the anomaly score is 'inf'.

The GLC Bayesian Changepoints Model

As a next step of our analysis we are going to use the GLC Bayesian Changepoints model and compare the results of these two methods. The Bayesian Changepoints implementation scores changepoint probability in a univariate sequential dataset, often a time series. Changepoints are abrupt changes in the mean or variance of a time series. For instance, during an economic recession, stock values might suddenly drop to a very low value. The time at which the stock value dropped is called a changepoint.

The Bayesian Changepoints model is an implementation of the Bayesian Online Changepoint Detection algorithm developed by Ryan Adams and David MacKay. This algorithm computes a probability distribution over the possible run lengths at each point in the data, where run length refers to the number of observations since the last changepoint. When the probability of a 0-length run spikes, there is most likely a change point at the current data point.

More specifically, the algorithm follows the procedure below:

Step 1: Observe new datum $x_{t}$ and evaluate the likelihood of seeing this value for each possible run length. This is a probability vector, with an element for all possible run lengths. A Gaussian distribution between each pair of changepoints is assumed.

\begin{equation*} L(r)= P(x|x_{r}) \end{equation*}

Step 2: For each possible run length, $r>0$, at current time $t$, calculate the probability of growth. expected_runlength is a parameter describing the a-priori best guess of run length. The larger expected_runlength is, the stronger the evidence must be in the data to support a high changepoint probability.

\begin{equation*} P_{t}(runlength\equiv r) = P_{t-1}(runlength\equiv r-1)\ast L(r)\ast \left(1-\frac{1}{{expected\_runlength}}\right) \end{equation*}

Step 3: Calculate probability of change, or $r=0$.

\begin{equation*} P_{t}(runlength\equiv 0)= \sum_{r_{prev}}\left[P_{t−1}(runlength\equiv r_{prev})\ast L(0)\ast \left(\frac{1}{expected\_runlength}\right)\right] \end{equation*}

Step 4: Normalize the probability. For all run length probabilities at time $t$, divide by the sum of all run length probabilities.

\begin{equation*} P_{t}(runlength\equiv r_{i})=\frac{P_{t}(runlength\equiv r_{i})}{\sum_{r}P_{t}(runlength\equiv r)} \end{equation*}

For each incoming point, this process is repeated.

This per-point update is why the method is considered an online learning algorithm.

As described, the algorithm scores each point $x_{t}$ immediately, but if the user can afford to wait several observations, it is often more accurate to assign lagged changepoint scores. The number of observations to wait before scoring a point is set with the lag parameter.

Libraries and Necessary Data Transformation

First we fire up GraphLab Create, all the other necessary libraries for our study and load the FRED/DCOILBRENTEU data set in an SFrame.


In [1]:
import graphlab as gl
import matplotlib.pyplot as plt


[INFO] graphlab.cython.cy_server: GraphLab Create v1.10.1 started. Logging: /tmp/graphlab_server_1466491373.log
INFO:graphlab.cython.cy_server:GraphLab Create v1.10.1 started. Logging: /tmp/graphlab_server_1466491373.log
This non-commercial license of GraphLab Create is assigned to tgrammat@gmail.com and will expire on September 21, 2016. For commercial licensing options, visit https://dato.com/buy/.

In [2]:
fred_dcoilbrenteu = gl.SFrame.read_csv('./FRED-DCOILBRENTEU.csv')


Finished parsing file /home/theod/Documents/ML_Home/12.DatoPy/01.R.Anomaly_Detection/FRED-DCOILBRENTEU.csv
Parsing completed. Parsed 100 lines in 0.472685 secs.
Finished parsing file /home/theod/Documents/ML_Home/12.DatoPy/01.R.Anomaly_Detection/FRED-DCOILBRENTEU.csv
Parsing completed. Parsed 7345 lines in 0.258323 secs.
------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,float]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------

In [3]:
fred_dcoilbrenteu


Out[3]:
DATE VALUE
2016-05-02 45.82
2016-04-29 45.64
2016-04-28 45.6
2016-04-27 44.17
2016-04-26 43.94
2016-04-25 42.97
2016-04-22 43.97
2016-04-21 43.48
2016-04-20 43.09
2016-04-19 43.02
[7345 rows x 2 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Next we transform the DATE column in an appropriate timestamp format, and the fred_dcoilbrenteu SFrame in a TimeSeries object.


In [4]:
import time
import dateutil

def _unix_timestamp_to_datetime(x):
    import datetime
    import pytz
    return dateutil.parser.parse(x)

fred_dcoilbrenteu['DATE'] = fred_dcoilbrenteu['DATE'].apply(_unix_timestamp_to_datetime)
fred_dcoilbrenteu = gl.TimeSeries(fred_dcoilbrenteu, index='DATE')
fred_dcoilbrenteu


Out[4]:
DATE VALUE
1987-05-20 00:00:00 18.63
1987-05-21 00:00:00 18.45
1987-05-22 00:00:00 18.55
1987-05-25 00:00:00 18.6
1987-05-26 00:00:00 18.63
1987-05-27 00:00:00 18.6
1987-05-28 00:00:00 18.6
1987-05-29 00:00:00 18.58
1987-06-01 00:00:00 18.65
1987-06-02 00:00:00 18.68
[7345 rows x 2 columns]
Note: Only the head of the TimeSeries is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

We can plot the fred_dcoilbrenteu time series set as follows.


In [6]:
%matplotlib inline

def plot_time_series(timestamp, values, title, **kwargs):
    plt.rcParams['figure.figsize'] = 14, 7
    plt.plot_date(timestamp, values, fmt='g-', tz='utc', **kwargs)
    plt.title(title)
    plt.xlabel('Year')
    plt.ylabel('Dollars per Barrel')
    plt.rcParams.update({'font.size': 16})
    
plot_time_series(fred_dcoilbrenteu['DATE'], fred_dcoilbrenteu['VALUE'],\
                 'Crude Oil Prices: Brent - Europe [FRED/DCOILBRENTEU]')


Training a Moving Z-Score Model

In this section we train a Moving Z-Score model to reveal where any anomalies exist in the fred_dcoilbrenteu time series.


In [7]:
window_size = 252 # average trading days per year 
model_moving_zscore =gl.anomaly_detection.moving_zscore.create(fred_dcoilbrenteu, 
                                                               window_size, feature='VALUE')

The primary output of the Moving Z-score model is the scores field. This TimeSeries object contains:

  • row id/time: ID of the corresponding row in the input dataset. Here the dataset is a TimeSeries object and the model returns the DATE timestamp. If it was an SFrame, this column would be filled with the row numbers of the input data.
  • anomaly score: absolute value of the moving Z-score. A score of 0 indicates that the value is identical to the moving average. The higher the score, the more likely a point is to be an anomaly.
  • VALUE: the recorded value of Dollars per Barrel of "Crude Oil Brent - Europe".
  • model update time: time that the model was updated. This is particularly useful for model updating.

In [8]:
scores = model_moving_zscore.scores.to_sframe()
scores.print_rows(num_rows=10, max_row_width=100)


+---------------------+---------------+-------+----------------+----------------------------+
|         DATE        | anomaly_score | VALUE | moving_average |     model_update_time      |
+---------------------+---------------+-------+----------------+----------------------------+
| 1987-05-20 00:00:00 |      None     | 18.63 |      None      | 2016-06-21 09:52:23.887247 |
| 1987-05-21 00:00:00 |      None     | 18.45 |      None      | 2016-06-21 09:52:23.887247 |
| 1987-05-22 00:00:00 |      None     | 18.55 |      None      | 2016-06-21 09:52:23.887247 |
| 1987-05-25 00:00:00 |      None     |  18.6 |      None      | 2016-06-21 09:52:23.887247 |
| 1987-05-26 00:00:00 |      None     | 18.63 |      None      | 2016-06-21 09:52:23.887247 |
| 1987-05-27 00:00:00 |      None     |  18.6 |      None      | 2016-06-21 09:52:23.887247 |
| 1987-05-28 00:00:00 |      None     |  18.6 |      None      | 2016-06-21 09:52:23.887247 |
| 1987-05-29 00:00:00 |      None     | 18.58 |      None      | 2016-06-21 09:52:23.887247 |
| 1987-06-01 00:00:00 |      None     | 18.65 |      None      | 2016-06-21 09:52:23.887247 |
| 1987-06-02 00:00:00 |      None     | 18.68 |      None      | 2016-06-21 09:52:23.887247 |
+---------------------+---------------+-------+----------------+----------------------------+
[7345 rows x 5 columns]


In [9]:
scores[252-10:252+10].print_rows(num_rows=60, max_row_width=100)


+---------------------+----------------+-------+----------------+----------------------------+
|         DATE        | anomaly_score  | VALUE | moving_average |     model_update_time      |
+---------------------+----------------+-------+----------------+----------------------------+
| 1988-05-03 00:00:00 |      None      | 16.08 |      None      | 2016-06-21 09:52:23.887247 |
| 1988-05-04 00:00:00 |      None      | 16.15 |      None      | 2016-06-21 09:52:23.887247 |
| 1988-05-05 00:00:00 |      None      | 16.15 |      None      | 2016-06-21 09:52:23.887247 |
| 1988-05-06 00:00:00 |      None      | 16.45 |      None      | 2016-06-21 09:52:23.887247 |
| 1988-05-09 00:00:00 |      None      |  16.5 |      None      | 2016-06-21 09:52:23.887247 |
| 1988-05-10 00:00:00 |      None      | 16.38 |      None      | 2016-06-21 09:52:23.887247 |
| 1988-05-11 00:00:00 |      None      | 16.48 |      None      | 2016-06-21 09:52:23.887247 |
| 1988-05-12 00:00:00 |      None      |  16.4 |      None      | 2016-06-21 09:52:23.887247 |
| 1988-05-13 00:00:00 |      None      |  16.5 |      None      | 2016-06-21 09:52:23.887247 |
| 1988-05-16 00:00:00 |      None      |  16.6 |      None      | 2016-06-21 09:52:23.887247 |
| 1988-05-17 00:00:00 | 0.618129451322 |  16.6 | 17.5782142857  | 2016-06-21 09:52:23.887247 |
| 1988-05-18 00:00:00 | 0.739515418384 |  16.4 | 17.5701587302  | 2016-06-21 09:52:23.887247 |
| 1988-05-19 00:00:00 | 0.828791286959 | 16.25 | 17.5620238095  | 2016-06-21 09:52:23.887247 |
| 1988-05-20 00:00:00 | 0.696288845646 | 16.45 | 17.5528968254  | 2016-06-21 09:52:23.887247 |
| 1988-05-23 00:00:00 | 0.829727849382 | 16.23 | 17.5443650794  | 2016-06-21 09:52:23.887247 |
| 1988-05-24 00:00:00 | 0.779202267777 |  16.3 | 17.5348412698  | 2016-06-21 09:52:23.887247 |
| 1988-05-25 00:00:00 | 0.848917075781 | 16.18 | 17.5257142857  | 2016-06-21 09:52:23.887247 |
| 1988-05-26 00:00:00 | 0.842437746788 | 16.18 | 17.5161111111  | 2016-06-21 09:52:23.887247 |
| 1988-05-27 00:00:00 | 0.791903088125 | 16.25 | 17.5065873016  | 2016-06-21 09:52:23.887247 |
| 1988-05-30 00:00:00 | 0.798348716932 | 16.23 | 17.4970634921  | 2016-06-21 09:52:23.887247 |
+---------------------+----------------+-------+----------------+----------------------------+
[20 rows x 5 columns]

Of course, the first 252 rows of the scores output don't have a moving average or Z-score. This is because the moving window does not have sufficient data for those observations.

To reveal the 30, lets say, more anomalous data points we can sort the scores SFrame as follows.


In [10]:
scores.sort('anomaly_score', ascending=False).print_rows(num_rows=30, max_row_width=100)


+---------------------+---------------+-------+----------------+----------------------------+
|         DATE        | anomaly_score | VALUE | moving_average |     model_update_time      |
+---------------------+---------------+-------+----------------+----------------------------+
| 1990-08-23 00:00:00 | 4.94908744385 | 32.35 | 18.6811111111  | 2016-06-21 09:52:23.887247 |
| 1990-08-06 00:00:00 | 4.90849694957 | 27.28 | 18.1406349206  | 2016-06-21 09:52:23.887247 |
| 1990-08-07 00:00:00 | 4.71149150733 | 27.35 | 18.1836111111  | 2016-06-21 09:52:23.887247 |
| 1996-04-11 00:00:00 | 4.53804346471 |  23.9 | 17.5057539683  | 2016-06-21 09:52:23.887247 |
| 1990-09-24 00:00:00 | 4.50692667542 | 40.75 | 19.9052777778  | 2016-06-21 09:52:23.887247 |
| 1990-08-24 00:00:00 | 4.46580793477 | 31.65 | 18.7421031746  | 2016-06-21 09:52:23.887247 |
| 1990-08-22 00:00:00 | 4.44080480957 | 30.45 | 18.6275396825  | 2016-06-21 09:52:23.887247 |
| 2014-10-15 00:00:00 | 4.19616613166 | 84.02 | 106.431111111  | 2016-06-21 09:52:23.887247 |
| 1990-09-26 00:00:00 |  4.1839983978 | 40.85 | 20.0845238095  | 2016-06-21 09:52:23.887247 |
| 1990-08-20 00:00:00 | 4.14741527256 |  28.9 |  18.533015873  | 2016-06-21 09:52:23.887247 |
| 1990-09-27 00:00:00 | 4.14695126066 | 41.45 | 20.1742857143  | 2016-06-21 09:52:23.887247 |
| 1990-09-25 00:00:00 | 4.14236776291 |  39.9 | 19.9973412698  | 2016-06-21 09:52:23.887247 |
| 1990-08-17 00:00:00 | 4.11387632257 | 28.45 | 18.4879761905  | 2016-06-21 09:52:23.887247 |
| 1990-08-21 00:00:00 | 4.05584517043 | 29.05 | 18.5799206349  | 2016-06-21 09:52:23.887247 |
| 2014-10-16 00:00:00 | 4.04172723541 | 84.02 | 106.329801587  | 2016-06-21 09:52:23.887247 |
| 1990-09-18 00:00:00 | 3.92508669645 | 35.95 | 19.6142063492  | 2016-06-21 09:52:23.887247 |
| 1990-09-28 00:00:00 | 3.91160087847 |  41.0 | 20.2650396825  | 2016-06-21 09:52:23.887247 |
| 1990-08-14 00:00:00 | 3.89409974502 |  27.1 | 18.3683333333  | 2016-06-21 09:52:23.887247 |
| 2014-10-14 00:00:00 | 3.88191121126 | 86.36 | 106.528055556  | 2016-06-21 09:52:23.887247 |
| 1990-09-21 00:00:00 | 3.80434125093 | 36.95 | 19.8278968254  | 2016-06-21 09:52:23.887247 |
| 1990-08-13 00:00:00 | 3.80306955823 | 26.63 | 18.3301190476  | 2016-06-21 09:52:23.887247 |
| 1990-09-17 00:00:00 | 3.79101202819 |  34.9 | 19.5463492063  | 2016-06-21 09:52:23.887247 |
| 1990-01-05 00:00:00 | 3.78091207137 | 23.13 | 18.3053968254  | 2016-06-21 09:52:23.887247 |
| 1990-08-10 00:00:00 | 3.76751744623 |  26.3 | 18.2932142857  | 2016-06-21 09:52:23.887247 |
| 1990-09-06 00:00:00 | 3.74504911422 | 32.15 | 19.1574603175  | 2016-06-21 09:52:23.887247 |
| 1996-04-10 00:00:00 | 3.74291343033 | 22.63 | 17.4898809524  | 2016-06-21 09:52:23.887247 |
| 2014-10-20 00:00:00 | 3.71890559552 | 84.42 | 106.133055556  | 2016-06-21 09:52:23.887247 |
| 2014-10-13 00:00:00 | 3.71085377874 | 87.82 | 106.618730159  | 2016-06-21 09:52:23.887247 |
| 2014-09-10 00:00:00 | 3.71057498153 | 96.26 |  108.03952381  | 2016-06-21 09:52:23.887247 |
| 1990-08-16 00:00:00 | 3.70834497293 |  27.2 | 18.4471031746  | 2016-06-21 09:52:23.887247 |
+---------------------+---------------+-------+----------------+----------------------------+
[7345 rows x 5 columns]

Of cource, a lot more anomalous observations may exist in the fred_dcoilbrenteu time series. A good way to make a final decision on that, is to look at the approximate distribution of the anomaly scores with the SArray.sketch_summary() tool, then get a threshold for the anomaly score with the sketch summary's quantile method. Here we declare the top 1% of the data to be anomalies, characterizing that way 71 data points as "anomalous".


In [11]:
sketch = scores['anomaly_score'].sketch_summary()
threshold = sketch.quantile(0.99)
anomalies = scores[scores['anomaly_score'] > threshold]
anomalies.print_rows(num_rows=30, max_row_width=100)


+---------------------+---------------+-------+----------------+----------------------------+
|         DATE        | anomaly_score | VALUE | moving_average |     model_update_time      |
+---------------------+---------------+-------+----------------+----------------------------+
| 1990-01-03 00:00:00 | 3.58789470296 | 22.65 | 18.2605555556  | 2016-06-21 09:52:23.887247 |
| 1990-01-04 00:00:00 | 3.37071772998 |  22.5 | 18.2835714286  | 2016-06-21 09:52:23.887247 |
| 1990-01-05 00:00:00 | 3.78091207137 | 23.13 | 18.3053968254  | 2016-06-21 09:52:23.887247 |
| 1990-08-06 00:00:00 | 4.90849694957 | 27.28 | 18.1406349206  | 2016-06-21 09:52:23.887247 |
| 1990-08-07 00:00:00 | 4.71149150733 | 27.35 | 18.1836111111  | 2016-06-21 09:52:23.887247 |
| 1990-08-08 00:00:00 | 3.41562168654 | 25.15 | 18.2249603175  | 2016-06-21 09:52:23.887247 |
| 1990-08-09 00:00:00 | 3.68825352627 |  25.9 | 18.2573015873  | 2016-06-21 09:52:23.887247 |
| 1990-08-10 00:00:00 | 3.76751744623 |  26.3 | 18.2932142857  | 2016-06-21 09:52:23.887247 |
| 1990-08-13 00:00:00 | 3.80306955823 | 26.63 | 18.3301190476  | 2016-06-21 09:52:23.887247 |
| 1990-08-14 00:00:00 | 3.89409974502 |  27.1 | 18.3683333333  | 2016-06-21 09:52:23.887247 |
| 1990-08-15 00:00:00 | 3.52081467148 | 26.53 | 18.4086111111  | 2016-06-21 09:52:23.887247 |
| 1990-08-16 00:00:00 | 3.70834497293 |  27.2 | 18.4471031746  | 2016-06-21 09:52:23.887247 |
| 1990-08-17 00:00:00 | 4.11387632257 | 28.45 | 18.4879761905  | 2016-06-21 09:52:23.887247 |
| 1990-08-20 00:00:00 | 4.14741527256 |  28.9 |  18.533015873  | 2016-06-21 09:52:23.887247 |
| 1990-08-21 00:00:00 | 4.05584517043 | 29.05 | 18.5799206349  | 2016-06-21 09:52:23.887247 |
| 1990-08-22 00:00:00 | 4.44080480957 | 30.45 | 18.6275396825  | 2016-06-21 09:52:23.887247 |
| 1990-08-23 00:00:00 | 4.94908744385 | 32.35 | 18.6811111111  | 2016-06-21 09:52:23.887247 |
| 1990-08-24 00:00:00 | 4.46580793477 | 31.65 | 18.7421031746  | 2016-06-21 09:52:23.887247 |
| 1990-09-03 00:00:00 | 3.56192123754 | 30.53 | 19.0050396825  | 2016-06-21 09:52:23.887247 |
| 1990-09-05 00:00:00 | 3.58166997412 | 31.23 | 19.1042857143  | 2016-06-21 09:52:23.887247 |
| 1990-09-06 00:00:00 | 3.74504911422 | 32.15 | 19.1574603175  | 2016-06-21 09:52:23.887247 |
| 1990-09-07 00:00:00 |  3.4340036975 | 31.45 | 19.2136904762  | 2016-06-21 09:52:23.887247 |
| 1990-09-10 00:00:00 | 3.34284674068 | 31.45 | 19.2674603175  | 2016-06-21 09:52:23.887247 |
| 1990-09-11 00:00:00 | 3.43243692257 |  32.1 | 19.3206349206  | 2016-06-21 09:52:23.887247 |
| 1990-09-14 00:00:00 | 3.50439445706 | 33.35 | 19.4850396825  | 2016-06-21 09:52:23.887247 |
| 1990-09-17 00:00:00 | 3.79101202819 |  34.9 | 19.5463492063  | 2016-06-21 09:52:23.887247 |
| 1990-09-18 00:00:00 | 3.92508669645 | 35.95 | 19.6142063492  | 2016-06-21 09:52:23.887247 |
| 1990-09-19 00:00:00 | 3.59236958756 | 35.08 |  19.686031746  | 2016-06-21 09:52:23.887247 |
| 1990-09-20 00:00:00 | 3.61984392861 | 35.65 | 19.7556746032  | 2016-06-21 09:52:23.887247 |
| 1990-09-21 00:00:00 | 3.80434125093 | 36.95 | 19.8278968254  | 2016-06-21 09:52:23.887247 |
+---------------------+---------------+-------+----------------+----------------------------+
[71 rows x 5 columns]

In the figure below, we plot the original FRED/DCOILBRENTEU time series of "Dollars per Barrel of Crude Oil Brent - Europe", its Moving Average across the years, and the data points that we found to be anomalous.


In [12]:
%matplotlib inline

plot_time_series(fred_dcoilbrenteu['DATE'], fred_dcoilbrenteu['VALUE'],\
                 'Crude Oil Prices: Brent - Europe [FRED/DCOILBRENTEU]', label='FRED/DCOILBRENTEU')
plt.plot_date(scores['DATE'], scores['moving_average'], fmt='b-', tz='utc', lw=2, label='Moving Average')
plt.plot(anomalies['DATE'], anomalies['VALUE'], 'rx', markersize=12, markeredgewidth=1.3, label='Anomalies')
plt.legend(loc='upper left', prop={'size': 16})
plt.show()


Training a Bayesian Changepoints Model

In this second part of our analysis we train a Bayesian Changepoints model to reveal where any anomalies exist in the fred_dcoilbrenteu time series.


In [13]:
model_bayesian_changepoints = gl.anomaly_detection.bayesian_changepoints.\
create(fred_dcoilbrenteu,
       feature='VALUE',
       # avg trading days per year
       expected_runlength = 252,
       # avg trading days per fiscal quarter
       lag=63)

The primary output of the Moving Z-score model is the scores field. This TimeSeries object contains:

  • row id/time: ID of the corresponding row in the input dataset. Here the dataset is a TimeSeries object and the model returns the DATE timestamp. If it was an SFrame, this column would be filled with the row numbers of the input data.
  • changepoint_score: The probability that the given point is a changepoint. This value is in a range between 0 and 1.
  • VALUE: the recorded value of Dollars per Barrel of "Crude Oil Brent - Europe".
  • model update time: time that the model was updated. This is particularly useful for model updating.

In [14]:
scores2 = model_bayesian_changepoints.scores.to_sframe()
scores2.print_rows(num_rows=10, max_row_width=100)


+---------------------+-------------------+-------+----------------------------+
|         DATE        | changepoint_score | VALUE |     model_update_time      |
+---------------------+-------------------+-------+----------------------------+
| 1987-05-20 00:00:00 | 1.29686499738e-20 | 18.63 | 2016-06-21 09:54:08.274432 |
| 1987-05-21 00:00:00 | 2.74120099668e-20 | 18.45 | 2016-06-21 09:54:08.274432 |
| 1987-05-22 00:00:00 | 7.22591567726e-20 | 18.55 | 2016-06-21 09:54:08.274432 |
| 1987-05-25 00:00:00 | 2.38425982431e-19 |  18.6 | 2016-06-21 09:54:08.274432 |
| 1987-05-26 00:00:00 | 7.65491442957e-19 | 18.63 | 2016-06-21 09:54:08.274432 |
| 1987-05-27 00:00:00 | 8.12765511891e-19 |  18.6 | 2016-06-21 09:54:08.274432 |
| 1987-05-28 00:00:00 | 1.97680271459e-18 |  18.6 | 2016-06-21 09:54:08.274432 |
| 1987-05-29 00:00:00 | 6.92500314492e-18 | 18.58 | 2016-06-21 09:54:08.274432 |
| 1987-06-01 00:00:00 | 2.20387527272e-17 | 18.65 | 2016-06-21 09:54:08.274432 |
| 1987-06-02 00:00:00 | 1.32613041369e-16 | 18.68 | 2016-06-21 09:54:08.274432 |
+---------------------+-------------------+-------+----------------------------+
[7345 rows x 4 columns]

To reveal the 30, lets say, more anomalous data points we can sort the scores SFrame as follows.


In [15]:
scores2.sort('changepoint_score', ascending=False).print_rows(num_rows=30, max_row_width=100)


+---------------------+-------------------+-------+----------------------------+
|         DATE        | changepoint_score | VALUE |     model_update_time      |
+---------------------+-------------------+-------+----------------------------+
| 1995-06-19 00:00:00 |   0.90101518149   | 16.93 | 2016-06-21 09:54:08.274432 |
| 1993-06-11 00:00:00 |   0.597158089704  | 17.63 | 2016-06-21 09:54:08.274432 |
| 1999-11-09 00:00:00 |   0.507500739218  | 24.44 | 2016-06-21 09:54:08.274432 |
| 1996-09-02 00:00:00 |   0.445819040465  | 22.23 | 2016-06-21 09:54:08.274432 |
| 2002-12-16 00:00:00 |   0.433560636239  | 28.73 | 2016-06-21 09:54:08.274432 |
| 2000-12-06 00:00:00 |   0.401576517802  | 27.47 | 2016-06-21 09:54:08.274432 |
| 1998-06-09 00:00:00 |   0.392100642229  | 12.76 | 2016-06-21 09:54:08.274432 |
| 2000-12-05 00:00:00 |   0.388410685527  | 28.88 | 2016-06-21 09:54:08.274432 |
| 2000-05-08 00:00:00 |   0.387010858837  | 26.03 | 2016-06-21 09:54:08.274432 |
| 2015-08-03 00:00:00 |   0.372779927272  | 49.49 | 2016-06-21 09:54:08.274432 |
| 2003-03-18 00:00:00 |   0.370761901973  | 28.55 | 2016-06-21 09:54:08.274432 |
| 1999-11-08 00:00:00 |   0.338941060425  | 23.62 | 2016-06-21 09:54:08.274432 |
| 1998-06-10 00:00:00 |   0.318597625743  | 12.23 | 2016-06-21 09:54:08.274432 |
| 1988-06-13 00:00:00 |   0.304947019691  | 15.53 | 2016-06-21 09:54:08.274432 |
| 1993-11-25 00:00:00 |   0.304185792104  | 14.35 | 2016-06-21 09:54:08.274432 |
| 2003-03-19 00:00:00 |   0.292840027839  |  28.4 | 2016-06-21 09:54:08.274432 |
| 1988-09-02 00:00:00 |   0.28354659943   |  14.0 | 2016-06-21 09:54:08.274432 |
| 2015-07-31 00:00:00 |   0.282706359521  | 53.29 | 2016-06-21 09:54:08.274432 |
| 2002-12-13 00:00:00 |   0.280068754576  | 27.64 | 2016-06-21 09:54:08.274432 |
| 2000-05-09 00:00:00 |   0.275048989119  | 26.69 | 2016-06-21 09:54:08.274432 |
| 2004-01-05 00:00:00 |   0.274173861349  |  32.3 | 2016-06-21 09:54:08.274432 |
| 1988-06-10 00:00:00 |   0.266631063592  | 15.85 | 2016-06-21 09:54:08.274432 |
| 1992-07-01 00:00:00 |   0.264558117223  | 20.25 | 2016-06-21 09:54:08.274432 |
| 1993-11-24 00:00:00 |   0.259702344042  | 15.13 | 2016-06-21 09:54:08.274432 |
| 1998-02-11 00:00:00 |   0.257667049696  | 14.35 | 2016-06-21 09:54:08.274432 |
| 1988-09-05 00:00:00 |   0.251226755215  | 13.68 | 2016-06-21 09:54:08.274432 |
| 2015-12-08 00:00:00 |   0.234369442362  | 39.44 | 2016-06-21 09:54:08.274432 |
| 2003-05-15 00:00:00 |   0.231745591569  | 26.77 | 2016-06-21 09:54:08.274432 |
| 2015-12-07 00:00:00 |   0.230019109768  | 39.69 | 2016-06-21 09:54:08.274432 |
| 2004-02-25 00:00:00 |   0.227675955176  | 32.46 | 2016-06-21 09:54:08.274432 |
+---------------------+-------------------+-------+----------------------------+
[7345 rows x 4 columns]

One interesting thing is that if you look at the tail of scores, you will see a handful of missing values. These data points have insufficient data after them to compute lagged changepoint scores. The number of missing values in the tail of the dataset can be reduced by equally reducing the lag parameter in our learning model. However, the returned results will be less accurate. Alternatively, one can choose to update the model with new data.


In [16]:
scores2.tail(80).print_rows(num_rows=80, max_row_width=100)


+---------------------+-------------------+-------+----------------------------+
|         DATE        | changepoint_score | VALUE |     model_update_time      |
+---------------------+-------------------+-------+----------------------------+
| 2016-01-08 00:00:00 |  3.1848228973e-07 | 31.67 | 2016-06-21 09:54:08.274432 |
| 2016-01-11 00:00:00 | 1.75562959897e-07 | 30.14 | 2016-06-21 09:54:08.274432 |
| 2016-01-12 00:00:00 | 8.45743413791e-08 | 29.14 | 2016-06-21 09:54:08.274432 |
| 2016-01-13 00:00:00 |  3.5737942485e-08 | 28.58 | 2016-06-21 09:54:08.274432 |
| 2016-01-14 00:00:00 |  1.4244636465e-08 | 28.84 | 2016-06-21 09:54:08.274432 |
| 2016-01-15 00:00:00 | 6.62431079546e-09 |  28.8 | 2016-06-21 09:54:08.274432 |
| 2016-01-18 00:00:00 | 3.36583267973e-09 | 27.36 | 2016-06-21 09:54:08.274432 |
| 2016-01-19 00:00:00 | 1.88146995405e-09 | 27.36 | 2016-06-21 09:54:08.274432 |
| 2016-01-20 00:00:00 |  1.3459841952e-09 | 26.01 | 2016-06-21 09:54:08.274432 |
| 2016-01-21 00:00:00 | 1.59052181279e-09 | 27.59 | 2016-06-21 09:54:08.274432 |
| 2016-01-22 00:00:00 | 2.35563778684e-09 | 30.46 | 2016-06-21 09:54:08.274432 |
| 2016-01-25 00:00:00 | 2.12645301859e-09 | 29.82 | 2016-06-21 09:54:08.274432 |
| 2016-01-26 00:00:00 |  2.6608990333e-09 | 30.94 | 2016-06-21 09:54:08.274432 |
| 2016-01-27 00:00:00 | 2.57008678246e-09 | 31.83 | 2016-06-21 09:54:08.274432 |
| 2016-01-28 00:00:00 | 1.96378332976e-09 | 33.01 | 2016-06-21 09:54:08.274432 |
| 2016-01-29 00:00:00 | 1.16854984134e-09 | 33.14 | 2016-06-21 09:54:08.274432 |
| 2016-02-01 00:00:00 |  8.2651663924e-10 | 32.45 | 2016-06-21 09:54:08.274432 |
| 2016-02-02 00:00:00 |        None       | 30.98 | 2016-06-21 09:54:08.274432 |
| 2016-02-03 00:00:00 |        None       | 32.38 | 2016-06-21 09:54:08.274432 |
| 2016-02-04 00:00:00 |        None       | 32.76 | 2016-06-21 09:54:08.274432 |
| 2016-02-05 00:00:00 |        None       | 32.35 | 2016-06-21 09:54:08.274432 |
| 2016-02-08 00:00:00 |        None       | 31.64 | 2016-06-21 09:54:08.274432 |
| 2016-02-09 00:00:00 |        None       | 30.15 | 2016-06-21 09:54:08.274432 |
| 2016-02-10 00:00:00 |        None       | 29.64 | 2016-06-21 09:54:08.274432 |
| 2016-02-11 00:00:00 |        None       | 28.82 | 2016-06-21 09:54:08.274432 |
| 2016-02-12 00:00:00 |        None       |  31.8 | 2016-06-21 09:54:08.274432 |
| 2016-02-16 00:00:00 |        None       | 31.09 | 2016-06-21 09:54:08.274432 |
| 2016-02-17 00:00:00 |        None       | 33.21 | 2016-06-21 09:54:08.274432 |
| 2016-02-18 00:00:00 |        None       |  33.2 | 2016-06-21 09:54:08.274432 |
| 2016-02-19 00:00:00 |        None       | 31.66 | 2016-06-21 09:54:08.274432 |
| 2016-02-22 00:00:00 |        None       | 33.59 | 2016-06-21 09:54:08.274432 |
| 2016-02-23 00:00:00 |        None       |  31.9 | 2016-06-21 09:54:08.274432 |
| 2016-02-24 00:00:00 |        None       |  31.5 | 2016-06-21 09:54:08.274432 |
| 2016-02-25 00:00:00 |        None       | 32.83 | 2016-06-21 09:54:08.274432 |
| 2016-02-26 00:00:00 |        None       | 35.76 | 2016-06-21 09:54:08.274432 |
| 2016-02-29 00:00:00 |        None       | 35.92 | 2016-06-21 09:54:08.274432 |
| 2016-03-01 00:00:00 |        None       | 35.73 | 2016-06-21 09:54:08.274432 |
| 2016-03-02 00:00:00 |        None       | 36.38 | 2016-06-21 09:54:08.274432 |
| 2016-03-03 00:00:00 |        None       | 35.75 | 2016-06-21 09:54:08.274432 |
| 2016-03-04 00:00:00 |        None       | 37.61 | 2016-06-21 09:54:08.274432 |
| 2016-03-07 00:00:00 |        None       | 39.02 | 2016-06-21 09:54:08.274432 |
| 2016-03-08 00:00:00 |        None       | 39.16 | 2016-06-21 09:54:08.274432 |
| 2016-03-09 00:00:00 |        None       | 40.26 | 2016-06-21 09:54:08.274432 |
| 2016-03-10 00:00:00 |        None       | 38.63 | 2016-06-21 09:54:08.274432 |
| 2016-03-11 00:00:00 |        None       | 39.41 | 2016-06-21 09:54:08.274432 |
| 2016-03-14 00:00:00 |        None       | 38.06 | 2016-06-21 09:54:08.274432 |
| 2016-03-15 00:00:00 |        None       | 37.49 | 2016-06-21 09:54:08.274432 |
| 2016-03-16 00:00:00 |        None       | 38.38 | 2016-06-21 09:54:08.274432 |
| 2016-03-17 00:00:00 |        None       | 39.29 | 2016-06-21 09:54:08.274432 |
| 2016-03-18 00:00:00 |        None       | 39.26 | 2016-06-21 09:54:08.274432 |
| 2016-03-21 00:00:00 |        None       | 39.91 | 2016-06-21 09:54:08.274432 |
| 2016-03-22 00:00:00 |        None       | 40.54 | 2016-06-21 09:54:08.274432 |
| 2016-03-23 00:00:00 |        None       | 38.84 | 2016-06-21 09:54:08.274432 |
| 2016-03-24 00:00:00 |        None       | 38.33 | 2016-06-21 09:54:08.274432 |
| 2016-03-28 00:00:00 |        None       | 38.33 | 2016-06-21 09:54:08.274432 |
| 2016-03-29 00:00:00 |        None       | 36.75 | 2016-06-21 09:54:08.274432 |
| 2016-03-30 00:00:00 |        None       | 36.75 | 2016-06-21 09:54:08.274432 |
| 2016-03-31 00:00:00 |        None       | 36.75 | 2016-06-21 09:54:08.274432 |
| 2016-04-01 00:00:00 |        None       | 36.42 | 2016-06-21 09:54:08.274432 |
| 2016-04-04 00:00:00 |        None       | 36.05 | 2016-06-21 09:54:08.274432 |
| 2016-04-05 00:00:00 |        None       | 35.88 | 2016-06-21 09:54:08.274432 |
| 2016-04-06 00:00:00 |        None       | 37.77 | 2016-06-21 09:54:08.274432 |
| 2016-04-07 00:00:00 |        None       | 37.15 | 2016-06-21 09:54:08.274432 |
| 2016-04-08 00:00:00 |        None       | 40.71 | 2016-06-21 09:54:08.274432 |
| 2016-04-11 00:00:00 |        None       | 41.58 | 2016-06-21 09:54:08.274432 |
| 2016-04-12 00:00:00 |        None       | 43.02 | 2016-06-21 09:54:08.274432 |
| 2016-04-13 00:00:00 |        None       | 42.81 | 2016-06-21 09:54:08.274432 |
| 2016-04-14 00:00:00 |        None       | 43.02 | 2016-06-21 09:54:08.274432 |
| 2016-04-15 00:00:00 |        None       | 41.32 | 2016-06-21 09:54:08.274432 |
| 2016-04-18 00:00:00 |        None       | 41.64 | 2016-06-21 09:54:08.274432 |
| 2016-04-19 00:00:00 |        None       | 43.02 | 2016-06-21 09:54:08.274432 |
| 2016-04-20 00:00:00 |        None       | 43.09 | 2016-06-21 09:54:08.274432 |
| 2016-04-21 00:00:00 |        None       | 43.48 | 2016-06-21 09:54:08.274432 |
| 2016-04-22 00:00:00 |        None       | 43.97 | 2016-06-21 09:54:08.274432 |
| 2016-04-25 00:00:00 |        None       | 42.97 | 2016-06-21 09:54:08.274432 |
| 2016-04-26 00:00:00 |        None       | 43.94 | 2016-06-21 09:54:08.274432 |
| 2016-04-27 00:00:00 |        None       | 44.17 | 2016-06-21 09:54:08.274432 |
| 2016-04-28 00:00:00 |        None       |  45.6 | 2016-06-21 09:54:08.274432 |
| 2016-04-29 00:00:00 |        None       | 45.64 | 2016-06-21 09:54:08.274432 |
| 2016-05-02 00:00:00 |        None       | 45.82 | 2016-06-21 09:54:08.274432 |
+---------------------+-------------------+-------+----------------------------+
[80 rows x 4 columns]

Of cource, a lot more anomalous observations may exist in the fred_dcoilbrenteu time series. A good way to make a final decision on that, is to look at the approximate distribution of the changepoint scores with the SArray.sketch_summary() tool, then get a threshold for the changepoint score with the sketch summary's quantile method. Again, we declare the top 1% of the data to be anomalies, characterizing that way 75 data points as "anomalous".


In [17]:
sketch2 = scores2['changepoint_score'].sketch_summary()
threshold2 = sketch2.quantile(0.99)
changepoints = scores2[scores2['changepoint_score'] > threshold2]
changepoints.print_rows(num_rows=105, max_row_width=100)


+---------------------+-------------------+--------+----------------------------+
|         DATE        | changepoint_score | VALUE  |     model_update_time      |
+---------------------+-------------------+--------+----------------------------+
| 1988-06-10 00:00:00 |   0.266631063592  | 15.85  | 2016-06-21 09:54:08.274432 |
| 1988-06-13 00:00:00 |   0.304947019691  | 15.53  | 2016-06-21 09:54:08.274432 |
| 1988-09-01 00:00:00 |   0.204701731073  | 14.15  | 2016-06-21 09:54:08.274432 |
| 1988-09-02 00:00:00 |   0.28354659943   |  14.0  | 2016-06-21 09:54:08.274432 |
| 1988-09-05 00:00:00 |   0.251226755215  | 13.68  | 2016-06-21 09:54:08.274432 |
| 1990-09-05 00:00:00 |   0.203866663955  | 31.23  | 2016-06-21 09:54:08.274432 |
| 1990-09-06 00:00:00 |   0.160739419458  | 32.15  | 2016-06-21 09:54:08.274432 |
| 1992-06-30 00:00:00 |   0.193569230058  |  20.6  | 2016-06-21 09:54:08.274432 |
| 1992-07-01 00:00:00 |   0.264558117223  | 20.25  | 2016-06-21 09:54:08.274432 |
| 1992-07-06 00:00:00 |   0.177006784278  | 20.45  | 2016-06-21 09:54:08.274432 |
| 1993-06-11 00:00:00 |   0.597158089704  | 17.63  | 2016-06-21 09:54:08.274432 |
| 1993-11-23 00:00:00 |   0.164084644965  | 15.25  | 2016-06-21 09:54:08.274432 |
| 1993-11-24 00:00:00 |   0.259702344042  | 15.13  | 2016-06-21 09:54:08.274432 |
| 1993-11-25 00:00:00 |   0.304185792104  | 14.35  | 2016-06-21 09:54:08.274432 |
| 1995-06-19 00:00:00 |   0.90101518149   | 16.93  | 2016-06-21 09:54:08.274432 |
| 1996-09-02 00:00:00 |   0.445819040465  | 22.23  | 2016-06-21 09:54:08.274432 |
| 1996-09-04 00:00:00 |   0.148184151318  |  22.2  | 2016-06-21 09:54:08.274432 |
| 1998-02-10 00:00:00 |   0.184069914828  |  14.6  | 2016-06-21 09:54:08.274432 |
| 1998-02-11 00:00:00 |   0.257667049696  | 14.35  | 2016-06-21 09:54:08.274432 |
| 1998-02-12 00:00:00 |   0.216443535863  | 14.04  | 2016-06-21 09:54:08.274432 |
| 1998-06-09 00:00:00 |   0.392100642229  | 12.76  | 2016-06-21 09:54:08.274432 |
| 1998-06-10 00:00:00 |   0.318597625743  | 12.23  | 2016-06-21 09:54:08.274432 |
| 1998-11-06 00:00:00 |   0.15961769082   | 11.51  | 2016-06-21 09:54:08.274432 |
| 1998-11-09 00:00:00 |   0.156372449682  | 11.15  | 2016-06-21 09:54:08.274432 |
| 1999-03-17 00:00:00 |   0.14951289282   | 12.95  | 2016-06-21 09:54:08.274432 |
| 1999-11-08 00:00:00 |   0.338941060425  | 23.62  | 2016-06-21 09:54:08.274432 |
| 1999-11-09 00:00:00 |   0.507500739218  | 24.44  | 2016-06-21 09:54:08.274432 |
| 2000-05-08 00:00:00 |   0.387010858837  | 26.03  | 2016-06-21 09:54:08.274432 |
| 2000-05-09 00:00:00 |   0.275048989119  | 26.69  | 2016-06-21 09:54:08.274432 |
| 2000-08-17 00:00:00 |   0.174294701123  | 30.71  | 2016-06-21 09:54:08.274432 |
| 2000-08-18 00:00:00 |   0.157135325246  | 30.76  | 2016-06-21 09:54:08.274432 |
| 2000-12-05 00:00:00 |   0.388410685527  | 28.88  | 2016-06-21 09:54:08.274432 |
| 2000-12-06 00:00:00 |   0.401576517802  | 27.47  | 2016-06-21 09:54:08.274432 |
| 2002-03-05 00:00:00 |   0.150810686174  | 22.25  | 2016-06-21 09:54:08.274432 |
| 2002-03-07 00:00:00 |   0.202847796813  |  23.1  | 2016-06-21 09:54:08.274432 |
| 2002-12-13 00:00:00 |   0.280068754576  | 27.64  | 2016-06-21 09:54:08.274432 |
| 2002-12-16 00:00:00 |   0.433560636239  | 28.73  | 2016-06-21 09:54:08.274432 |
| 2003-03-18 00:00:00 |   0.370761901973  | 28.55  | 2016-06-21 09:54:08.274432 |
| 2003-03-19 00:00:00 |   0.292840027839  |  28.4  | 2016-06-21 09:54:08.274432 |
| 2003-03-20 00:00:00 |   0.198289535597  |  28.0  | 2016-06-21 09:54:08.274432 |
| 2003-05-15 00:00:00 |   0.231745591569  | 26.77  | 2016-06-21 09:54:08.274432 |
| 2003-05-16 00:00:00 |   0.183622170335  | 27.18  | 2016-06-21 09:54:08.274432 |
| 2004-01-05 00:00:00 |   0.274173861349  |  32.3  | 2016-06-21 09:54:08.274432 |
| 2004-02-24 00:00:00 |   0.145932141919  |  31.6  | 2016-06-21 09:54:08.274432 |
| 2004-02-25 00:00:00 |   0.227675955176  | 32.46  | 2016-06-21 09:54:08.274432 |
| 2004-02-26 00:00:00 |   0.157042339181  | 32.45  | 2016-06-21 09:54:08.274432 |
| 2005-02-22 00:00:00 |   0.19887720275   |  47.6  | 2016-06-21 09:54:08.274432 |
| 2005-02-23 00:00:00 |   0.18508835355   | 48.16  | 2016-06-21 09:54:08.274432 |
| 2005-06-17 00:00:00 |   0.227186699947  | 56.92  | 2016-06-21 09:54:08.274432 |
| 2006-09-07 00:00:00 |   0.194714296519  | 64.52  | 2016-06-21 09:54:08.274432 |
| 2006-09-08 00:00:00 |   0.190396625279  |  64.3  | 2016-06-21 09:54:08.274432 |
| 2006-09-11 00:00:00 |   0.172616457261  | 62.41  | 2016-06-21 09:54:08.274432 |
| 2007-03-23 00:00:00 |   0.199160260612  |  63.1  | 2016-06-21 09:54:08.274432 |
| 2007-03-26 00:00:00 |   0.199804203991  | 64.43  | 2016-06-21 09:54:08.274432 |
| 2007-10-11 00:00:00 |   0.152539683274  | 80.83  | 2016-06-21 09:54:08.274432 |
| 2007-10-12 00:00:00 |   0.177569784884  | 80.82  | 2016-06-21 09:54:08.274432 |
| 2007-10-15 00:00:00 |   0.21383080324   |  82.5  | 2016-06-21 09:54:08.274432 |
| 2007-10-16 00:00:00 |   0.168233339423  | 84.43  | 2016-06-21 09:54:08.274432 |
| 2008-04-15 00:00:00 |   0.154493604648  | 110.84 | 2016-06-21 09:54:08.274432 |
| 2008-11-17 00:00:00 |   0.153328334889  | 50.82  | 2016-06-21 09:54:08.274432 |
| 2008-11-18 00:00:00 |   0.219983335284  |  49.1  | 2016-06-21 09:54:08.274432 |
| 2008-11-19 00:00:00 |   0.17690399955   | 48.35  | 2016-06-21 09:54:08.274432 |
| 2009-05-26 00:00:00 |   0.157187248804  | 59.05  | 2016-06-21 09:54:08.274432 |
| 2009-05-27 00:00:00 |   0.16463407354   | 61.28  | 2016-06-21 09:54:08.274432 |
| 2010-12-01 00:00:00 |   0.158144471336  | 88.56  | 2016-06-21 09:54:08.274432 |
| 2010-12-02 00:00:00 |   0.147271966207  | 89.37  | 2016-06-21 09:54:08.274432 |
| 2014-12-05 00:00:00 |   0.144973533274  |  68.0  | 2016-06-21 09:54:08.274432 |
| 2014-12-08 00:00:00 |   0.216839439305  | 65.64  | 2016-06-21 09:54:08.274432 |
| 2014-12-09 00:00:00 |   0.171399726561  | 66.11  | 2016-06-21 09:54:08.274432 |
| 2014-12-10 00:00:00 |   0.174043288333  | 63.32  | 2016-06-21 09:54:08.274432 |
| 2015-07-31 00:00:00 |   0.282706359521  | 53.29  | 2016-06-21 09:54:08.274432 |
| 2015-08-03 00:00:00 |   0.372779927272  | 49.49  | 2016-06-21 09:54:08.274432 |
| 2015-12-07 00:00:00 |   0.230019109768  | 39.69  | 2016-06-21 09:54:08.274432 |
| 2015-12-08 00:00:00 |   0.234369442362  | 39.44  | 2016-06-21 09:54:08.274432 |
| 2015-12-09 00:00:00 |   0.190248605256  | 39.04  | 2016-06-21 09:54:08.274432 |
+---------------------+-------------------+--------+----------------------------+
[75 rows x 4 columns]

In the figure below, we plot the original FRED/DCOILBRENTEU time series of "Dollars per Barrel of Crude Oil Brent - Europe", its Moving Average across the years, and the data points that we found to be anomalous with both the Moving Z-Score and the Bayesian Changepoint model.


In [18]:
%matplotlib inline

plt.rcParams['figure.figsize'] = 14, 24
plt.figure(1)

plt.subplot(3,1,1)
plt.plot_date(fred_dcoilbrenteu['DATE'], fred_dcoilbrenteu['VALUE'],\
              fmt='g-', tz='utc', label='FRED/DCOILBRENTEU')
plt.plot_date(scores['DATE'], scores['moving_average'],\
              fmt='b-', tz='utc', lw=2, label='Moving Average')
plt.xlabel('Year')
plt.ylabel('Dollars per Barrel')
plt.title('Crude Oil Prices: Brent - Europe [FRED/DCOILBRENTEU]')
plt.rcParams.update({'font.size': 16})
plt.plot(anomalies['DATE'], anomalies['VALUE'],\
         'bx', markersize=12, markeredgewidth=1.3, label='Anomalies [Moving Z-Score]')
plt.legend(loc='upper left', prop={'size': 16})

plt.subplot(3,1,2)
plt.plot_date(fred_dcoilbrenteu['DATE'], fred_dcoilbrenteu['VALUE'],\
              fmt='g-', tz='utc', label='FRED/DCOILBRENTEU')
plt.plot_date(scores['DATE'], scores['moving_average'],\
              fmt='b-', tz='utc', lw=2, label='Moving Average')
plt.xlabel('Year')
plt.ylabel('Dollars per Barrel')
plt.title('Crude Oil Prices: Brent - Europe [FRED/DCOILBRENTEU]')
plt.rcParams.update({'font.size': 16})
plt.plot(changepoints['DATE'], changepoints['VALUE'],\
         'rx', markersize=12, markeredgewidth=1.3, label='Anomalies [Bayesian Changepoints]')
plt.legend(loc='upper left', prop={'size': 16})

plt.subplot(3,1,3)
plt.plot_date(scores2['DATE'], scores2['changepoint_score'],\
              fmt='r-', tz='utc', lw=2, label='Bayesian Changepoint Probability')
plt.rcParams.update({'font.size': 16})
plt.xlabel('Year')
plt.ylabel('Changepoint Probability')
plt.title('Crude Oil Prices: Brent - Europe [FRED/DCOILBRENTEU]')
plt.legend(loc='upper left', prop={'size': 16})

plt.show()


Interestingly enough, the Bayesian Changepoint model revealed some new anomalous points which were obviously missed by the Moving Z-Score algorithm. More specifically, new anomalous data points observed during the periods 1998-2000 and 2006-2010. More importantly, the great depreciation of Crude Oil Prices (Brent-Europe) during the recent great recession (2007-2008) is now characterized an an additional anomalous point which is of course true.


In [ ]: